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Abstract 

In this paper, we introduce a distributed algorithm that optimizes the Gaussian signal covariance matrices of multi¬ 
antenna users transmitting to a common multi-antenna receiver under imperfect and possibly delayed channel state 
information. The algorithm is based on an extension of exponential learning techniques to a semidefinite setting and it 
requires the same information as distributed water-filling methods. Unlike water-filling however, the proposed matrix 
exponential learning (MXL) algorithm converges to the system’s optimum signal covariance profile under very mild 
conditions on the channel uncertainty statistics; moreover, the algorithm retains its convergence properties even in the 
presence of user update asynchronicities, random delays and/or ergodically changing channel conditions. In particular, 
by properly tuning the algorithm’s learning rate (or step size), the algorithm converges within a few iterations, even for 
large numbers of users and/or antennas per user. Our theoretical analysis is complemented by numerical simulations 
which illustrate the algorithm’s robustness and scalability in realistic network conditions. 

Index Terms 

Imperfect CSI; matrix exponential learning; MIMO; signal covariance optimization; stochastic approximation. 

I. Introduction 

Following the seminal prediction that the use of multiple antennas in signal transmission and reception can lead 
to substantial performance gains [1, 2], multiple-input and multiple-output (MIMO) technologies have become an 
integral component of state-of-the-art wireless communication protocols (such as 3G LTE, 4G, HSPA+ and WiMax 
to name but a few). To capitalize on these gains, the emerging massive MIMO paradigm, a contending technology for 
5G networks, “goes large” by scaling up existing multiple-antenna transceivers through the use of inexpensive service 
antennas and time-division duplexing (TDD) in order to focus energy into ever smaller regions of space [3-5]. In so 
doing, massive MIMO arrays can increase throughput by a factor of lOx (or more), bring about significant latency 
reductions over the air interface, and greatly improve the system’s robustness to ambient noise [5, 6]. 
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Nevertheless, when coupled with the projected network densification (e.g. due to the massive deployment of small 
cells), the resulting antenna density is expected to create increased interference which will have to be mitigated through 
spatial focusing. To that end, it is crucial to optimize the input signal distribution of each user, especially in the 
moderate signal-to-interference-and-noise ratio (SINR) regime: in this way, wireless receivers can achieve higher 
transmission rates with the same power, thus increasing spatial spectrum reuse and improving their radiated energy 
efficiency and overall quality of experience (QoE). 

This optimization is typically achieved via water-filling (WF) methods [7-9] that rely on the transmitters having 
access to accurate channel state information (CSI), including their individual transfer matrices and the multi-user 
interference-plus-noise (MUI) that they are facing at the receiver. One way to collect the required CSI is to have the 
base station (BS) emit reverse-link pilot waveforms that allow each terminal to estimate the corresponding channel 
responses over a given frequency band. However, given that the number of responses that must be estimated at each 
terminal is proportional to the number of antennas at the BS, massive MIMO systems may require up to a hundred 
times more pilots than conventional MIMO systems [4]; as such, one of the most popular solutions is to operate in 
TDD mode and rely on uplink-downlink reciprocity for the exchange of CSI pilot feedback [5], 

In this context, a major challenge occurs when transmitters are required to optimize their input signal covariance ma¬ 
trices in the presence of imperfect and/or delayed CSI - e.g. due to the vastly increased impact of pilot contamination 
in massive MIMO systems [3, 10]. In the absence of perfect channel state information at the transmitter (CSIT), the 
convergence of WF methods is no longer guaranteed because the algorithms’ fixed point can be significantly perturbed 
by erroneous (or obsolete) CSI. As a result, the efficient deployment of massive MIMO systems calls for flexible and 
robust optimization algorithms that are capable of dealing with feedback uncertainty originating from temporal fading, 
lack of adaptation synchronicity, measurement errors due to noise and pilot contamination, etc. 

In this paper, we propose a distributed optimization algorithm based on the method of matrix exponential learning 
(MXF) that was recently introduced by the authors of [11, 12]. Essentially, rather than updating their signal covariance 
matrices, transmitters update the logarithm of these matrices based on (possibly imperfect) measurements of a matrix 
analogue of the transmitter’s SINR. The benefit of updating the logarithm of a user’s covariance matrix is that the 
algorithm’s updates only need to be Hermitian (and not necessarily positive-definite), so it is much easier to respect 
the problem’s semidefiniteness constraints. Furthermore, in contrast to WF methods, the proposed algorithm proceeds 
by aggregating CSI feedback over time: in this way, measurement errors, noise and asynchronicities effectively vanish 
in the long run thanks to the law of large numbers for martingales. In particular, the proposed algorithm has the 
following desirable attributes: 

1) It is distributed: user updates are based on local information and channel measurements. 

2) It is robust: measurements and observations may be subject to random errors and noise. 

3) It is stateless: users do not need to know the state (or topology) of the system. 

4) It is reinforcing: each user tends to increase his own rate. 

5) It is decentralized: user updates need not be synchronized or otherwise coordinated. 
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A good paradigm to test the performance and convergence properties of the proposed algorithm is the widely 
studied vector Gaussian multiple access channel (MAC) [8], This channel model is the MIMO equivalent of the 
parallel multiple access channel (PMAC) and consists of several (and mutually independent) MIMO transceivers that 
are linked to a common multi-antenna receiver. In this framework, assuming perfect CSIT, it is well known that 
iterative water-filling (IWF) converges to the system’s optimum transmit profile [8], but the algorithm’s convergence 
speed decreases proportionally with the number of transmitting users. On the other hand, simultaneous water-filling 
(SWF) is much faster, but it may fail to converge altogether: as was shown in [13], the sufficient conditions that 
guarantee the convergence of SWF methods may fail to hold even in simple 2x2 PMAC systems; To make matters 
worse, this situation is exacerbated in the case of imperfect CSI where even the convergence of IWF is no longer 
guaranteed; by contrast, the proposed MXL algorithm converges rapidly to the system’s optimum transmit profile 
(even for large numbers of users and/or antennas per user), and it remains convergent irrespective of the magnitude of 
the measurement noise. 

Our theoretical analysis relies on the powerful stochastic approximation methods of [14, 15] and the matrix reg¬ 
ularization machinery of [16, 17]. Specifically, we first establish the algorithm’s convergence in a continuous-time 
setting, and we then use the ODE method of stochastic approximation [14] and the theory of concentration inequalities 
for martingales [18, 19] to show that this convergence is retained in a discrete-time setting - even under noisy and 
delayed/asynchronous updates. 1 The algorithm’s main tunable parameter is its step-size which controls the rate at 
which users learn: picking larger step sizes accelerates the algorithm, but this acceleration comes at the expense of 
accuracy (a crucial trade-off in the presence of noise). Still, if the error process has finite p-th moments for some 
p > 20, convergence is guaranteed as long as the algorithm’s step-size y n satisfies yV P '~ < °°; in practice, the 
central limit theorem guarantees that estimators of Gaussian variables have finite moments of every order, so the 
algorithm’s step-size can be taken nearly constant, guaranteeing in this way relatively rapid convergence even in very 
noisy environments. In addition, we show that, under mild conditions on the moments of the error process, the tail 
behavior of the deviations is Gaussian - i.e. it is sharply concentrated around its mean. 

Paper outline and summary of results 

After introducing our system model in the next section, we derive the proposed matrix exponential learning algo¬ 
rithm in Section III-A (cf. Algorithm 1); in the same section, we also state our main convergence result in the presence 
of measurement noise and errors (Theorem 1). Since one of the main goals of this paper is to illustrate the convergence 
properties of MXL in the presence of various stochastic impediments. Section III-B provides a concrete example 
of feedback matrix estimation, while Section III-C extends the convergence results of Section III-A to the case of 
asynchronous user updates (Theorem 2). Section III-D describes the evolution of the eigenvalues and eigenvectors of 
the users’ signal covariance matrices (Theorem 3), while Section IV extends our results further to the case of fast- 

*For a related continuous-to-discrete descent, see the recent paper [20] on unilateral online optimization in dynamically varying cognitive radio 


systems. 
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fading channels (Theorem 4). Finally, our theoretical analysis is validated and supplemented by numerical simulations 
in Section V. 

To streamline the flow of the paper, proofs and technical details have been delegated to a series of appendices at the 
end. 


II. System Model 

Consider a Gaussian vector multiple access channel (MAC) where a finite set of wireless users k e'K = {1,.... AT} 
transmit simultaneously over a common channel to a base receiver with N antennas. If the k-th transmitter is equipped 
with Mi- transmit antennas, we get the familiar signal model 

Z K 

k=\ HfcXjt + z, (1) 

where: 

1) x* e C Mi is the message transmitted by user k e'K. 

2) y € C ,v denotes the aggregate signal at the receiver. 

3) H^. e C NxMt is the N x M k channel matrix of user k. 

4) z e is the ambient noise in the channel, including thermal, atmospheric and other peripheral interference 
effects (and modeled for simplicity as a zero-mean, complex Gaussian vector with unit covariance). 2 
In this context, the average transmit power of user k is simply 

Pk = E [||x*|| 2 ] = tr(Qt), (2) 

where Q/ denotes the user’s signal covariance matrix: 

Qa = E [x*xj], (3) 

and the expectation is taken over the Gaussian codebook of user k. Hence, assuming that each user’s maximum transmit 
power is finite, we obtain the feasibility constraints: 

Qa > 0 and tr(Q*) < P k , (4) 

where P k > 0 denotes the maximum transmit power of user k. 

The first part of our analysis focuses on static channels, i.e. the channel matrices Ha will be assumed to remain 
constant (or nearly constant) throughout the transmission horizon (fast-fading channels will be treated in Section IV). 
In this case, assuming single user decoding (SUD) at the receiver (i.e. interference by all other users is treated as 
additive colored noise), each user’s achievable transmission rate will be given by the familiar expression [2]: 

R k (Q) = log det (VV a + HaQaH 2 ) - log det (W_ t ), (5) 

2 Obviously, depending on the structure and statistical properties of the channel matrices H^, the signal model (1) could be applied to a wide 
variety of telecommunications systems, ranging from digital subscriber line (DSL) uplink networks with Tceplitz circulant H* to code division 
multiple access (CDMA) and/or frequency division multiple access (FDMA) radio networks [21]. For concreteness however, we will interpret (1) 
as an ad hoc multi-user MIMO multiple access channel with representing the channel of each link. 
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where Q = (Qi,...,Qk) and 


W_* = I + 2 fttt H / Q,Hj 


( 6 ) 


represents the multi-user interference-plus-noise (MUI) covariance matrix of user k. Thus, given that each user needs 
to saturate his power constraints in order to maximize his individual rate, we will say that a transmit profile Q = 
(QI.---.Qi) is at Nash equilibrium when no user can unilaterally improve his individual achievable rate A 1 /,, i.e. 


RkiQl > R k (Qk\ Q' k ) for all Q, e x k , keK, (NE) 

where (Q k ; Q*_ k ) is shorthand for (QJ,..., Q*,..., Q^) and 

X k = {Q* € C M ‘ XM ‘ : Q k > 0, tr(Q,) - P k \ (7) 


denotes the (compact, convex) set of feasible signal covariance matrices for user k. ' 

Dually to the above, if the receiver employs successive interference cancellation (SIC) techniques to decode the 
received messages, the users’ achievable sum rate will be [8]: 

R( Q) = log det (I + 2* H,Q,H;) . (8) 


In this way, we obtain the sum rate maximization problem: 


maximize R( Q), 


(RM) 


subject to Q k e X k , k = 

Importantly, as can be easily checked, the users’ sum rate (8) is a potential function for the game defined by (5) in the 
sense that 


Rk(Qk,Q-k) - Rk(Q' k ;,Q-k) = R(Qc Q-*) - R(Q' k , Q-*). 


(9) 


Hence, with R concave, it follows that the solutions of the Nash equilibrium problem (NE) coincide with the solutions 
of (RM); put differently, optimizing the users’ achievable sum rate (8) under SIC is equivalent to equilibrating the 
users’ individual transmission rates (5) under SUD. 

For concreteness, in the rest of this paper, we will focus on the sum rate maximization problem (RM); however, 
owing to the above observation, our results will obviously apply to the unilateral equilibration problem (NE) as well. 


III. Learning with Imperfect and Delayed Information 

The sum rate maximization problem (RM) is traditionally solved by water-filling (WF) methods [7], either iterative 
[8, 22] or simultaneous [23], More precisely, transmitters are typically assumed to have perfect knowledge of the 
channel matrices and the aggregate signal-plus-noise covariance matrix 

W = E[yyt] =I+2rHrQ f Hj, (10) 

3 Under an energy-aware objective, users would not need to saturate their power constraints, but such considerations lie beyond the scope of this 


paper. 
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which is in turn used to calculate the MUI covariance matrices W_£ = W - 1I/ Q/ FI, and “water-fill” the effective 
channel matrices H/. = W at the transmitter [8]. At a multi-user level, this water-filling process could take place 

either iteratively (with users updating their covariance matrices in a round robin fashion) [8] or simultaneously (with 
all users updating at once) [23]: the former updating scheme converges always (but slowly for large numbers of users) 
[8], whereas the latter is much faster [23] but may occasionally fail to converge, even in simple, 2-user parallel multiple 
access channels [13]. 

An added complication in the use of WF methods is that they rely on perfect channel state information at the 
transmitter (CSIT) and accurate measurements of W at the receiver (who can broadcast this information via a dedicated 
radio channel or as part of the TDD downlink phase). When such measurements are not available, it is not known 
whether WF methods converge; on that account, our goal in this section will be to describe a distributed learning 
method that allows users to attain the system’s sum capacity in a distributed way, using only imperfect (and possibly 
delayed) information. 


A. Matrix exponential learning 


Instead of relying on fixed-point methods, our approach will rely on tracking the direction of steepest ascent of the 
system’s sum rate in a dual, unconstrained space, and then map the result back to the problem’s feasible space via 
matrix exponentiation. Formally, assuming for the moment perfect CSIT, we will consider the MXL scheme: 


where: 


Y kin + 1) = Y k (n) + y„V*(Q(n)), 

n , , n p exp(Y*(n + 1)) 

Qk(n + 1 ) = Pk 


tr[exp(Y*(« + 1))]’ 


(MXL) 


1) V k = V/, (Q) denotes the (matrix) derivative of the system’s sum rate with respect to each user’s covariance matrix: 

V*(Q) = V Qt * = V q ,/L = HJW'H* (11) 

2) Y* is an auxiliary “scoring” matrix which tracks the direction of steepest sum rate ascent. 

3) y n is a decreasing step-size sequence (typically, y n = 1/n). 


Remark 1. Given that R( Q) is real-valued, V/. and Y* are automatically Hermitian, so the matrix exponential expfY/j 
is positive-definite, as required; the trace normalization in (MXL) then ensures that Q*(«) satisfies the feasibility 
constraints (7) of (RM) for all n > 1. In this way, (MXL) can be seen as a two-stage, “primal-dual” gradient method 
[24] which reinforces the spatial directions that lead to higher sum rates by allocating more power to the corresponding 
eigen-directions of the users’ covariance matrices. 


To account for imperfect CSIT and noisy measurements at the receiver, we will assume that the gradient matrices V 
of (11) are only known up to a noisy estimate V. In particular, we envision the following sequence of events: 

1) At every update period n = 1 , 2 ,..., each user k e 'K gets an estimate V/•(/;) of the true gradient matrix VAQ(n)). 

2) Users update their signal covariance matrices according to (MXL) and the process repeats. 

More concretely, this recurring process may be encoded in algorithmic form as follows: 
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Algorithm 1 Matrix Exponential Learning (MXL). 

Parameter: decreasing step-size sequence y„ 

Initialize: n <- 0; Y k <- 0; Q* <- ^ • I 

Repeat 

n <— n + 1; 

foreach user k e'K do 

receive estimate V* of V* = 

update score matrix: Y k <- Y k + y n Y k \ 

set covariance matrix: Q k <- P k • explY^/trlexplYi.)]; 
until termination criterion is reached. 


The MXL algorithm above will be the main focus of our paper, so a few remarks are in order: 

a) Implementation: From an implementation point of view, MXL has the following desirable properties: 

(PI) Distributedness: users have the same information requirements as in distributed water-filling [8, 22, 23], 

(P2) Robustness: the algorithm does not assume perfect CSIT or precise signal measurements at the receiver. 

(P3) Statelessness: users do not need to know the state of the system (e.g. its topology). 

(P4) Reinforcement: users reinforce the transmit directions that lead to higher transmission rates. 

b) Assumptions on the measurement errors: Throughout this paper, we will work with the following statistical 
hypotheses for the noise process Z k (n) = Y k (n) - Vj(Q(n)) ; 

(HI) Unbiasedness: 

E [Z(n + 1) | Q(n)] = 0. (HI) 

(H2) Finite mean squared error (MSE): 

E [ ||Z(« + 1)|| 2 | Q(n)] < o- 2 for some cr > 0. (H2) 

The statistical hypotheses above allow us to account for a very wide range of error processes: in particular, we will 
not be assuming independent and identically distributed (i.i.d.) errors, or even errors that are a.s. bounded. 4 In fact, 
Hypotheses (HI) and (H2) simply amount to asking that the gradient estimate V be unbiased and bounded in mean 
square: 

E [||Vjfc(n + 1)|| 2 | Q(?0] < V 2 k for some V k > 0, (12) 

and our convergence results will be stated with only this mild requirement in mind. 

That being said, we obtain even sharper results in some cases under the additional hypothesis: 

(H3) Conditionally symmetric error distributions: the law of Z (n + 1) given Q(n) is symmetric. 

Hypothesis (H3) also applies to the vast majority of centered processes (uniform, Gaussian, Levy a-stable, Laplace, 
etc.) and we will use it to further refine our convergence results. 

4 This observation is crucial in the context of wireless networks because measurement errors are typically correlated with the state of the system. 
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c) The estimation process: As stated. Algorithm 1 does not detail how the system’s users can obtain an unbiased 
estimate of V* from individual CSI observations and measurements of the received signal covariance matrix W. To 
simplify our presentation, we will state our convergence results below under the assumption that there is an oracle-like 
mechanism that returns such estimates to the users on request; the construction of such a mechanism will then be 
detailed in Section III-B. 

Remark. We should also note here that if the gradient estimates V/ ; are not Hermitian, the score matrices Y will not 
be Hermitian either so the users’ covariance matrices may fail to be positive-definite. To avoid such complications, it 
suffices to take [V + V']/2 of V as an estimator for V; for simplicity, we will tacitly assume that this hermitization step 
has already taken place if necessary. 

d) Complexity per iteration: From a computational standpoint, it is easy to see that the complexity of each 
iteration of Algorithm 1 is polynomial (with a low degree) in the number of transmit and receive antennas (for 
calculations at the transmitter and receiver side respectively). Specifically, the complexity of the required matrix 
inversion and exponentiation steps is 0(N U ) and 0(M'“) respectively, where the exponent a> is as low as 2.373 if the 
processing units employ fast Coppersmith-Winograd methods for matrix multiplication [25]. 5 The Hermitian structure 
of W can be exploited to reduce the computational cost of each iteration even further, but such issues lie beyond the 
scope of this paper; in practice, the number of transmit and receive antennas are physically constrained by the size of 
the wireless array, so these operations are quite light. 

With all this in mind, our main result is as follows: 

Theorem 1. Assume that the MXL algorithm (Alg. 1 ) is run with nonincreasing step-sizes y„ such that 7 « < 
2 n 7n - 00 an d noisy measurements \(n) satisfying hypotheses (HI) and (H2). Then, Q (n) converges almost surely to 
arg maxg R( CD- 

More generally, let R n — 2™=i JjRjl Z" = i Jj denote the time average of the users’ sum rate with respect to an 
arbitrary nonincreasing step-size sequence y n . Then: 


0 

E ^ ^max &id 

(13) 

ii) 

IP (fimax - Rn > S„ + z) = 0 (a 2 Z~ 2 t~ 2 £" = 1 rf) 

(14) 

where /? max 

= maxQR(Q) is the system’s sum capacity, t n = YIj=\ Jj> an d 



sn = f,; 1 [zf = i log M k + \l 2 2" =1 y]\ - 

(15) 


denotes the algorithm’s mean performance guarantee at the n-th update period (in the above. Mi is the number of 
transmit antennas of user k and L? = Yrk= l ^l^i * s a positive constant depending on the users’ maximum transmit 
powers Pi and the mean square Vj: of their measurements). 

5 In particular, the complexity of each iteration of Algorithm 1 is that of matrix multiplication. 
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Finally, if (H3) also holds, the concentration of the algorithm around its mean value at a given iteration is 
exponential: 

- log P (ft max - R„ >e„+z)=0 ( t;,z 2 cr~ 2 / 2" =1 yj) (16) 

Proof: See Appendix B. ■ 

In what follows, we discuss some important points regarding Theorem 1: 

a) On the choice ofy n : The use of a decreasing step-size sequence y n in (MXL) might appear counter-intuitive 
because it implies that new gradients enter the algorithm with decreasing weights (after all, intuition suggests that one 
should put more weight on recent observations rather than older, obsolete ones). However, if the algorithm has reached 
a near-optimal point, a constant step size might cause it to overshoot and miss its mark: this can be seen clearly from 
the mean error bound (15) which does not vanish as n —> oo for constant step sizes of the form y„ = y. 

As a rule of thumb, the use of a (large) constant step size speeds up the algorithm but may also lead to unwanted 
oscillations towards the end because it does not dissipate measurement noise and discretization errors: if the system’s 
users seek to eliminate such phenomena, a decreasing step size should be preferred instead. In fact, to obtain the best 
of both worlds, we can consider an adaptive schedule where the method’s step-size is initially very high (to accelerate 
the search of the problem’s state space), and is then abruptly decreased when oscillations are detected; we test such 
schedules in Section V. 

We should also note here that the requirement y\ < oo for almost sure convergence is tied to the finite MSE 
hypothesis (H2) and can be relaxed significantly if there are sharper bounds on the central moments of the estimator 
V. For instance, if the error process Z (n) has finite p-th moments (cf. Hypothesis (H2') below), the analysis of [14] 
shows that the summability condition f n y\ < oo can be replaced by the much lighter requirement f,, y ] „ +p '~ < oo 
which allows us to use considerably faster step-size sequences. 

b) Convergence rate: If users employ a constant step-size sequence yj = y for a number of iterations n that 
infixed in advance (using “for” instead of “while”), an easy calculation shows that the minimum value of the mean 
guarantee (15) is attained for yj = y = L~ l y/2 log M^/n and is equal to: 6 

/2 2 i logM i . 

e n = L \ -■ (17) 

V n 

Put differently, the mean performance guarantee (15) with constant step size falls below e > 0 in n — 2Lr log M^/s 2 = 
0(Ks 2 ) iterations. 

That being said, this guarantee concerns the empirical average of the system’s sum rate R n = n 1 f , R s and not 
the users’ instantaneous sum rate R„. Empirical averages evolve much slower than actual values and, as we show in 
Section V, the users’ instantaneous sum rate R„ increases much faster and converges to the system’s sum capacity 
within a few iterations (even for very large numbers of users and/or antennas per user). 

6 Note also that R„ is then given by the standard expression R„ = n 1 2" =1 Rj- 
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c) Large deviations and outage probabilities: Eq. (14) represents the probability of observing sum rates far 
below the channel’s capacity, so it can be interpreted as a measure of the system’s outage probability under (MXL). 
As such, the tail behavior of (14) shows that MXL hardens considerably around its deterministic limit: even though 
measurement errors can become arbitrarily large (note that a finite MSE does not imply a.s. bounded errors), the 
probability of observing sum rates much lower than what is obtainable with perfect gradient measurements decays 
very fast - exponentially so if (H3) holds. In particular, for large n, the step-size factor t~ 2 Y!)=\ Tj which controls the 
width of non-negligible large deviations (the parameter z) in (14) is of order 0(1/n) for step-size sequences of the form 
y n oc ,r a , a e (0,1/2), and of order 0(n 2a - 2 ) for a e (1/2,1). 

Finally, we should also note that the symmetry condition (H3) is not at all necessary to obtain the exponential rate 
of decay (16). As we show in Appendix B, the same bound is also obtained under the following slight reinforcement 
of (H2): 

(H2') Subexponential error moment growth: 

E [||V(n + 1) - V(Q(« + l))f | Q(n)] < y a 9 (H2') 

for some cr > 0 and for all p e N. 

This hypothesis is also satisfied by a wide range of probability distributions (including all distributions with finite 
support, Gaussian, sub-Gaussian and sub-exponential tails) and it offers a useful alternative to (H3) when the only 
thing that can be estimated is the errors’ raw moments. 

d) Links with matrix regularization: The proof of Theorem 1 relies on stochastic approximation techniques [14] 
and a deep connection between matrix exponentiation and the von Neumann quantum entropy. In fact, as we show 
in the appendix, (MXL) is closely related to the matrix regularization techniques of [16, 17, 20] for online learning 
and the mirror descent machinery of [24, 26] for (stochastic) convex programming. In particular, the “convergence- 
in-the-mean” bound (13) is derived in the same way as the corresponding results of [24, 26], but the techniques 
developed therein do not suffice for the much stronger almost sure convergence result that we present here. An in- 
depth description of mirror descent methods requires the introduction of considerable technical apparatus from convex 
analysis and lies beyond the scope of this paper; for a comprehensive account, see instead [24, 26] and references 
therein. 

B. Unbiased gradient estimators 

As we mentioned before, the MXL algorithm requires users to have access to unbiased estimates V* of their 
individual gradient matrices (11). Our goal in this section will be to describe a process with which the receiver and the 
transmitters may estimate V based at each step on possibly imperfect signal and channel measurements - e.g. obtained 
through the exchange of pilot feedback signals. 

The first step will be to estimate the aggregate signal precision (inverse covariance) matrix 

P = Etyy+r 1 = W- 1 = (i + HfQ f Hj) 


( 18 ) 
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by sampling the signal y e C' v at the receiver. To that end, since the channel is Gaussian, an unbiased estimate for the 
received signal covariance W = E[yy^ ] can be obtained from a systematically unbiased sample {y s }^ =1 of size S by 
means of the well-known estimator W = j 2, =1 y s yi. 7 However, W 1 is a biased estimate for W 1 , so inverting W 
directly would introduce a systematic error in P. Instead, following [27], we will consider the bias-adjusted precision 
matrix estimator: 


P = 


S -N- 1 
S 


W 


(19) 


where N is the number of antennas at the receiver (the dimension of y) and W = S 1 Eli y*yJ as before. 

Consequently, in the absence of perfect channel state information at the transmitter, the transmitters must estimate 
the individual gradient matrices V'/ t = H, W ! II/ from the broadcast of P and using imperfect measurements of their 
channel matrices H*. To that end, if each transmitter takes S independent measurements Hu, • • •, of his channel 
matrix (e.g. via independent pilot sampling), an unbiased Hermitian estimate for V* is given by the expression: 


V* = 


— 1 —y 

S(S - 1) ^ 


Hi PH 


k,S ': 


( 20 ) 


where P is the latest estimate of (19) of W 1 that was broadcast by the receiver. 8 Indeed, given that the sampled 
channel matrix measurements H* :>J are assumed stochastically independent, we readily obtain: 


E[V*] - 


—y 
- 11 


S(S - 1) t-i***> 


E [H PH, 

,! L k,s 1 


■k,s f 


= H W 'Hi, 


( 21 ) 


i.e. (20) constitutes an unbiased estimator of V. 

The construction above provides an estimator V with E [V] = V, so Assumption (HI) holds. As for the variance of 
V, (20) can also be used to derive an expression for Var(V) in terms of the moments of P and H. Since the system input 
and noise are assumed Gaussian, the former are all finite (and Gaussian-distributed) so the finite mean square error 
hypothesis (H2) boils down to measuring H with finite mean squared error - a requirement which is easy to achieve. In 
fact, if the sample size S is taken large enough, the central limit theorem guarantees significant control on the variance 
and higher central moments of V, so even the tighter requirements (H2') and (H3) hold. 


Remark 2. Under time-division duplexing (TDD) operation, the estimation process above is greatly facilitated because 
the estimates of each user’s channel matrix H/ can be calculated directly at the transmitter via reverse-link pilots. In 
the absence of TDD (or in the case where the number of antennas at the receiver is massively large, viz. N 2 » Af?), 

a more efficient solution would be to feed back to each transmitter an unbiased estimate of Vt that is calculated directly 
at the receiver. 


C. Asynchronous updates and delays 

Even though the information requirements of (MXL) are local in nature, the algorithm itself is not fully decentralized 
because it relies implicitly on a global timer to coordinate the users’ update schedule (similarly to IWF and SWF 

7 Since the expected value E[y] = 0 of y is known in advance, we do not need to include an .S’/(.S' - 1) bias correction factor in the estimate of W. 

8 Obviously, increasing the sample size S decreases the estimator’s MSE at the cost of computational complexity. 
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methods). To overcome this limitation, we examine here a fully decentralized variant of Algorithm 1 where each user 
updates his signal covariance matrix based on an individual timer and completely independently of other users. 

Of course, in this case, the measurements V/ c may suffer from delays and asynchronicities, so the update structure 
of Algorithm 1 must be suitably modified. To that end, let n denote the n -th overall update period in the system, let 
c 'K denote the subset of users who update at this epoch (typically \'K n \ — 1 if users update at random times), and 
let dk(n) be the number of periods that have elapsed since the last update of the k-th user. We then obtain the following 
asynchronous variant of (MXL): 


Y k (n + 1) = Y k (n) + y ril l(k e %,) ■ V*(n), 
n , , n p exp(Y/t(n + 1)) 

Q k(n +l)-Pk- 


(MXL-a) 


tr[exp(Yfc(n + 1))]’ 

where n k = YI)=\ 1(£ e ( Kf) denotes the number of updates that have been performed by user k up to epoch n and Y k 
is a noisy (and asynchronous) estimate of (11): 


Yk(n) = Hj 




H k + Z k (n). 


( 22 ) 


with Z (n) satisfying (HI) and (H2) as before. 

By definition, Y k (n) and Q/ (/i) are updated at the (n + l)-th update period if and only if k e 'K n , so every user 
only needs to keep track of his individual update timer. In this way, we obtain the following decentralized variant of 
Algorithm 1 (shown here for a single, focal transmitter and with step sizes of the form y n = y/n for simplicity): 


Algorithm 2 Asynchronous exponential learning (MXL-a). 
Parameter: y > 0. 

Initialize: n <- 0; Y <- 0; Q ^ ■ I 

Repeat 

foreach UpdateEvent do 

n <— n + 1; 

receive estimate V of V; 

update score matrix: Y <- Y + y/n ■ V; 

set covariance matrix: Q <- P exp(Y)/tr[exp(Y)]; 
until termination criterion is reached. 


Remarkably, in this asynchronous context (with delayed, imperfect measurements), we still get: 

Theorem 2. Assume that the users’ delay processes d k (n) are bounded ( a.s .) and that the set of users 7C„ that update 
at step n is a homogeneous recurrent Markov chain - i.e. every user updates at a positive rate. Then, the iterates of 
Algorithm 2 converge almost surely to arg maxg R( Q). 


Proof: See Appendix C. 
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Obviously, Algorithm 2 enjoys the same implementation properties as Algorithm 1, and, in addition: 

(P5) Asynchronicity: there is no need for a global update timer to synchronize the network’s wireless users. 

In particular, the criteria that trigger an UpdateEvent could be completely arbitrary, so (MXL-a) is more suitable for 
scenarios where there can be no coordination between the transmitters’ update periods. Otherwise, if an UpdateEvent 
is triggered simultaneously (e.g. if it is triggered by the receiver’s broadcasts). Algorithm 2 reduces to synchronous 
MXL (Alg. 1): in this case, the algorithm’s convergence can be greatly sped up because more users update per period. 

D. Eigenvalues, eigenvectors and learning 

We close this section by describing the evolution of the users’ transmit eigenvectors and eigenvalues under (MXL) 
and using this description to propose an alternative implementation of Algorithm 1 . The key ingredient of our analysis 
is the following proposition: 

Proposition 1. Let {qkai^ka}™^ be a smooth eigen-system for Q k and let V k ^ = \J ka X k \i k p. Then, the iterates of 
Algorithm 1 track the mean dynamics: 



(23a) 



(23b) 


Proof: See Appendix C. 


The precise sense in which Q (n) “tracks” the mean dynamics (23) is explained in Appendix A; for our purposes, the 
most important consequence of Proposition 1 is that (23) leads to the following variant of Algorithm 1 : 

Algorithm 3 Eigen-based exponential learning (MXL-eig). 

Parameter: decreasing step-size sequence y n 
Initialize: n <— 0; q ka \ u to 

Repeat 

n <— n + 1; 

foreach user k e W do 
measure V*; 

update eigenvalues: q ka <- q ka + y n q ka ( V k ao - P ji 1 2^\ qkgV k ^\ 
update eigenvectors: <- u ka + y„ 2/?*a V^ ff (logry to - log q kp )~'\i kp \ 

correct roundoff errors: u <— Orthonormal(u); 

_ set covariance matrix Q k ^ 2^, qka 
until termination criterion is reached. 


As in the case of the original MXL algorithm, we then obtain: 
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Theorem 3. Assume that Algorithm 3 is run with sufficiently small, nonincreasing step-sizes y„ such that y~ < 
Tjh 7n — °°- Then, Q («) converges almost surely to argmaxg R{ Q). 

We close this section with a few remarks on Algorithm 3: 

a) The orthonormalization step: Even though the eigenvector dynamics (23b) preserve orthonormality. Algo¬ 
rithm 3 introduces an 0{y^) round-off error to orthogonality due to discretization. The call to Orthonormal performs a 
basis orthonormalization and it is intended to correct that error in order to yield a covariance matrix Q/ that satisfies the 
feasibility constraints (7) of (RM): like matrix exponentiation, orthonormalization has the same complexity as matrix 
multiplication (fast Coppersmith-Winograd methods [25] provide an <9(M~ 373 ) bound), so this does not increase the 
algorithm’s (polynomial) complexity. 

b) Noisy measurements: Theorem 3 has been stated for simplicity for noiseless measurements. In the case of 
noisy measurements, the step-size of the algorithm must be tuned adaptively so that q^ a > 0; it is not hard to do so by 
using the random step-size techniques of [14], but we chose to focus on the noiseless case for presentational clarity. 


IV. The Case of Fast-Fading Channels 


In the presence of fading, the users’ channel matrices evolve stochastically over time at a rate which is much 
faster than the characteristic length of each transmission block; as a result, the static sum rate function R of (8) is 
no longer relevant. In this case, the users’ achievable sum rate for fixed Q under fast fading is given by the ergodic 
average [2, 28]: 

flerg(Q) = Eh [log det (i + H,Q,H 7 )|, (24) 


where the expectation is now taken with respect to the law of H (assumed here to follow a stationary, ergodic process). 
Accordingly, we obtain the ergodic rate maximization problem for fast-fading channels: 

maximize R e rg(Q), 


(ERM) 


subject to Q* e Xk , k = 1,..., K, 

where the users’ feasible sets Xk are defined as in (7): Xk = (Qi > 0 : tr(Q,) - P t }. 9 

Since expectation preserves convexity, the ergodic rate maximization problem (ERM) remains concave - in fact, it 
is straightforward to show that (ERM) is strictly concave [29]. However, given that the integration over the law of H 
is typically impossible to carry out, calculating the ergodic gradient V erg = VR erg of R ag is a likewise impractical task. 
Thus, instead of relying on intricate analytic calculations (that require substantial computation capabilities and a good 
deal of knowledge regarding the channels’ statistics), we will consider the same sequence of events as in the case of 
static channels: 

1 ) At every update period n = 1,2,... , each user k e'K gets an estimate ’Vk(u) of the matrix 


Vk(n) = H fa) 


I +Y, e Ht(n)Qt(n)H\(n) 


H k (n). 


(25) 


9 Perhaps more appropriately for the case of interest, the above expression also holds over the long term for block-fading channels, in which case 
the channel is essentially fixed over each transmission length during which the instantaneous rate is used. 
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where denotes the instantaneous realization of the channel matrix of user k at period n. 

2) Users update their signal covariance matrices according to the recursion (MXL) and the process repeats until a 
termination criterion is reached. 

Formally, writing Zk(n) = V/ ( («) - E [VA'OI for the difference between the users’ observed estimate V/ ; (/ij and 
the expected value of (25), we will make the same statistical hypotheses for Z as in the static regime - though we 
should note here that fluctuations are now due to both measurement errors and the channels’ inherent variability. 
Quite remarkably, despite the change of objective function, we obtain the following convergence result for fast-fading 
channels: 

Theorem 4. Assume that Algorithm 1 is run with nonincreasing step-sizes y„ such that y\ < y„ = oo and noisy 
measurements V (n) satisfying hypotheses (HI) and (H2) with respect to (25). Then, Q(«) converges almost surely to 
the solution of the ergodic rate maximization problem (ERM); moreover, the conclusions of Theorem 1 for an arbitrary 
nonincreasing step-size sequence y n also hold with the static sum rate R replaced by the ergodic sum rate R erg . 

Proof: See Appendix D. ■ 

In view of Theorem 4, we see that Algorithm 2 enjoys the additional property: 

(P6) Flexibility: the MXL algorithm can be applied “as-is” in both static and fast-fading channels. 

In particular, the same convergence rate and large deviation estimates that were derived for static channels in the 
previous section (cf. the remarks following Theorem 1) also carry over to the fast-fading regime - as does the analysis 
of Secs. III-B, III-C and III-D for the users’ measurement process and for the asynchronous and eigen-based variants 
of MXL respectively. The only difference here is that the variance cr that appears e.g. in (14) and (16) is not only due 
to imperfections in the estimation process of V, but also stems from the inherent variability of the system’s channels 
due to fast-fading. We will explore this issue in the following section. 

V. Numerical Simulations 

To validate the theoretical analysis of the previous sections and to assess the performance of the MXL algorithm 
(Alg. 1) in practical scenarios, we conducted extensive numerical simulations from which we illustrate here a selection 
of the most representative cases. 

First, in Figure 1 , we investigate the convergence speed of MXL as a function of the number of wireless transmitters, 
using existing water-filling (WF) methods as a benchmark. To that end, we simulated a multi-user uplink MIMO 
system consisting of a single receiver with N = 24 antennas and different numbers of wireless transmitters, each with 
a random number of transmit antennas, chosen randomly between 2 and 8 (due to space limitations, we only present 
here the case of K = 20 and K — 50 users). The users’ channel matrices H/ were then drawn from a complex Gaussian 
distribution at the outset of the process and were kept fixed throughout the transmission horizon. 10 

10 For simplicity, we neglect effects of path-loss in this simulation. This treatment is reasonable if we assume that users ‘'invert” their pathloss 
function by appropriately compensating their total transmission power level P/, [30], 
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(a) Normalized throughput over time for K = 20 users. (b) Normalized throughput over time for K = 50 users. 

Fig. 1. Comparison of matrix exponential learning (MXL) to water-filling (WF) methods. The classical IWF algorithm converges relatively slowly 
(roughly within 0(K) iterations) because only one user updates per cycle; the SWF variant is much faster (because all users updates simultaneously), 
but it may fail to converge due to the appearance of best-response cycles in the update process. By contrast, we see that the MXL algorithm converges 
within a few iterations, even for large numbers of users. 


For comparison purposes, we ran Algorithm 1 with constant step size alongside the classical iterative water-filling 
algorithm proposed in [8] and the simultaneous variant of [23], initializing all methods with a uniform power allocation 
profile that assigns the same power to all transmit antennas (the benchmark case). The algorithms’ performance over 
time was then assessed by plotting the normalized throughput gain 

r n = R„/Ro, (26) 

where R„ denotes the system’s sum rate at the 77 -th iteration of the algorithm, and the benchmark rate R<, denotes the 
system’s sum rate when all users employ a uniform beamforming policy. 

As can be seen in Fig. 1, the MXL algorithm attains the system’s sum capacity within a few iterations (effectively, 
within a single iteration for K - 50 users). 11 This convergence behavior represents a marked improvement over 
traditional WF methods, even in moderately-sized systems with K = 20 users: on the one hand, IWF is significantly 
slower than MXL (it requires O(K) iterations to achieve the same performance level as the first iteration of MXL), 
whereas SWF may fail to converge altogether due to “ping-pong” effects where users overcommit to a given spatial 
direction by diverting too much power to it all at once (thus “congesting” it) and then switch transmit directions in an 
effort to exploit the ensuing spatial gap (thus relinquishing the benefits from employing it in the first place). 12 

In Figure 2, we investigate the robustness of MXL under imperfect signal and channel measurements, and we 
compare it to iterative and simultaneous WF methods under similar conditions. Specifically, in Fig. 2, we simulated a 
multi-user uplink MIMO system consisting of a single receiver with N — 24 antennas and K = 20 wireless transmitters 
with antenna and channel characteristics as in Fig. 1. To simulate noisy measurements, we used the estimation scheme 

11 Alternatively, in the game-theoretic context of (NE), this implies that the system’s users reach a unilaterally stable Nash equilibrium. 

'-For a detailed account of best-response cycles of this kind, see e.g. [31]. 
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Normalized throughput over time (with measurement errors) Normalized throughput over time (with measurement errors) 



(a) Learning with a relative error level of 10%. (b) Learning with a relative error level of 50%. 

Fig. 2. Performance of matrix exponential learning and water-filling (WF) methods under imperfect CSI. In contrast to WF methods, the MXL 
algorithm attains the channel’s sum capacity, even in the presence of very high measurement errors. 


of Section III-B where the received signal precision matrix P = E |yy f ] 1 of Eq. (18) is estimated by sampling the 
aggregate signal y at the receiver and then feeding the unbiased sample mean (19) to the transmitters. The measurement 
noise was controlled by the relative error level of the estimator (deviation/mean), so a relative error level of rj means 
that, on average, the estimated matrix lies within q% of its true value (in terms of the Frobenius matrix norm). We then 
plotted the efficiency of MXL over time for average error levels of rj = 10% and 77 = 50%, and we ran the iterative and 
simultaneous WF algorithms with the same sample realizations for comparison. 

As can be seen in Figure 2, the performance of water-filling methods remains acceptable at low error levels, allowing 
users to attain between 90% and 95% of the channel’s sum capacity. However, when the measurement error level gets 
higher, water-filling (either iterative or simultaneous) offers no perceptible advantage over the users’ initial signal 
covariance matrices (the benchmark case of uniform power allocation across antennas). By contrast, as predicted by 
Theorem 1, the MXL algorithm retains its convergence properties and converges to the system’s sum capacity, even 
under very noisy measurements - though, of course, the algorithm’s convergence speed is negatively impacted when 
the measurement noise grows too high. 

Finally, to account for time-varying channel conditions, we also plotted the performance of the proposed MXL 
algorithm in non-static channels, following the well-known Jakes model for Rayleigh fading [32]. Specifically, in 
Figure 3, we consider a MIMO uplink system consisting of a receiver with N = 8 antennas and K = 10 mobile 
users with 2 transmit antennas, each transmitting at a central frequency of / = 2 GHz. For the users’ mobility model, 
we used the extended pedestrian A (EPA) and extended vehicular A (EVA) models [33] with average user velocities 
v = 5 m/s (pedestrian movement) and v = 15 m/s (vehicular movement in traffic-heavy urban environments). We then 
ran the MXL algorithm with an update period of 6 - 5 ms (one update per frame), and we plotted the algorithm’s 
achieved sum rate R n at the n -th iteration of the algorithm versus a) the system’s sum capacity R'' } ax given the current 
realization of the channel matrices (f) at time t — ri6; and b) the users’ sum rate under uniform power allocation over 
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Sum Rate under Rayleigh Fading (v = 5 km/h, f = 2 GHz, step = 5 ms) 


Sum Rate under Rayleigh Fading (v = 15 km/h, f = 2 GHz, step = 5 ms) 



Time (ms) 


Time (ms) 


(a) Performance of MXL with average user velocity u = 5 m/s. 


(b) Performance of MXL with average user velocity v = 15 m/s. 


Fig. 3. Data rates achieved by MXL in a dynamic environment with time-varying channels following the Jakes model for Rayleigh fading (model 
parameters indicated in the figure caption). The dynamic transmit policy induced by the MXL algorithm allows users to track the system’s sum 
capacity remarkably well, even under rapidly changing channel conditions. 


their antennas. Thanks to its high convergence speed and its robustness, the MXL algorithm tracks the system’s sum 
capacity remarkably well, despite the channels’ variability. Moreover, the throughput difference between the learned 
transmit covariance profile and the uniform one shows that this tracking is not an artifact of the system’s sum capacity 
falling within a narrow band of what could be attained by spreading power uniformly over antennas: users actively 
track the system’s optimum transmit profile as it evolves over time, even under rapidly varying channel conditions. 

VI. Conclusions and Perspectives 

In this paper, we introduced a distributed signal covariance optimization algorithm to maximize the uplink sum 
capacity of multi-antenna users that transmit to a common multi-antenna receiver with only imperfect, possibly 
delayed and asynchronously updated channel state information at the users’ disposal. Under fairly mild hypotheses 
for the moments of the statistics of the estimation imperfections, we showed that the proposed matrix exponential 
learning (MXL) algorithm converges rapidly, even for large numbers of users and/or antennas per user; moreover, the 
probability that the algorithm deviates beyond a small error from the optimum after a fixed number of iterations is 
very small (and decays exponentially if the moments of the error process do not grow too fast). In our view, these 
robustness properties of MXL make it an attractive alternative to water-filling methods that may fail to converge 
altogether in the presence of measurement noise. This is confirmed by extensive numerical simulations which exhibit 
the fast convergence and robustness properties of MXL in realistic channel conditions. 

We focused on the MIMO multiple access channel only for simplicity. The proposed algorithm can be readily 
extended to a MIMO-OFDM framework, different precoding schemes (such as MMSE or ZF-type precoders) or to 
account for other transmission features such as spectral mask constraints, pricing, etc. The method can also be adapted 
to wide range channel models (such as the interference channel), where a game-theoretic approach as in [9, 22] is more 
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appropriate: in this context, a natural question that arises is whether the algorithm converges to a Nash equilibrium, 
and whether this convergence is retained in the presence of noise. In fact, thanks to the exponentiation step, our method 
can be adapted to even more general constrained matrix optimization problems as in [34, 35]; we intend to explore 
these directions in future work. 


Appendix 
Technical Proofs 

Our goal in this appendix will be to prove the convergence results presented in the rest of our paper. To that end, 
we will first establish the convergence of a deterministic, “mean-field” dynamical system associated to the MXL 
algorithm, and we will then show that the iterates of MXL and its variants comprise a stochastic approximation thereof 
[14], Our robustness results will then follow by exploiting the powerful martingale concentration inequalities of [19]. 

For notational clarity, in the rest of this appendix (and unless explicitly stated otherwise), we will treat the case of a 
single user with maximum transmit power P — 1; the general case is simply a matter of taking a direct sum over k e 'K 
and rescaling by the corresponding maximum power P L 

A. The mean dynamics of exponential learning 

We begin by considering the following continuous-time version of the basic recursion (MXL): 

Y = V, 

_ exp(Y) (MXL-c) 

Q " tr[exp(Y)]’ 

The following proposition shows that (MXL-c) converges to the solution set of the rate maximization problem (RM): 
Theorem 5. Let Q(r) be a solution orbit of (MXL-c). Then, Q(f) converges to a minimizer of (RM). 

A key ingredient in our proof will be the (negative) von Neumann quantum entropy: 

h( Q) = tr[QlogQ], Q e Q, (27) 

and its convex conjugate (Legendre transform): 13 

h\ Y) = max QeQ [tr[YQ] - h(Q)\ = log tr [exp(Y)], (28) 

where Q denotes the (compact) spectrahedron: 

<3 - {Q ^ 0 : tr(Q) = 1}. (29) 

It will also be convenient to introduce the following “primal-dual” coupling between Q and Y: 

F( Q, Y) = h(Q) + h*(Y) - tr [QY]. (30) 

1 J The convexity of h(Q) is well known [36], To derive the expression (28) for its conjugate, simply note that h(Q) = 2 jlj l°g tj where qj are 
the eigenvalues of Q: Eq. (28) then follows from the expression h*(y) = log 2 j expfy f ) for the Legendre transform of the (negative) Gibbs entropy. 
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Eq. (30) gathers all the terms of Fenchel’s inequality [37], so, following [38], we will refer to it as the Fenchel coupling 
between Q and Y. Below, we present some key properties of F: 


Lemma 1. With notation as above, we have F{ Q, Y) > 0 with equality iff Q = exp(Y)/ tr[exp(Y)]. Moreover: 


and 


V y /z*(Y) 


exp(Y) 
tr [exp(Y)] ’ 


V y E(Q,Y) 


exp(Y) 

tr[exp(Y)] 


(31) 


(32) 


Proof: By standard matrix analysis results [39], we have: 

V Y A*(Y) = tr \vM Vy tr[exp(Y)] = (33) 

tr[exp(Y)] tr[exp(Y)] 

and (32) follows trivially. The first part of our claim is then a consequence of the general theory of convex conjugation 
- see e.g. [37, Chap. 26]. ■ 


Proof of Theorem 5: Our proof relies on the fact that the Fenchel coupling H(t ) = F( Q*, Y(t )) is a Lyapunov 
function for (MXL-c) for every maximizer Q* of (RM). Indeed, the definition of the Fenchel coupling and Lemma 1 
yield 

H = tr [V y /z*(Y) ■ Y] - tr [Q*Y] = tr [(Q - Q*) ■ V Q R] < 0, (34) 


where the inequality in the last step follows from the concavity of R and the fact that Q* is a maximizer of R. Moreover, 
equality in (34) holds if and only if Q is also a maximizer of R, so H is a Lyapunov function for (MXL-c) with respect 
to arg max R. 

The above reasoning shows that (MXL-c) converges to arg max R, but since R is not necessarily strictly concave, 
this does not imply that every trajectory of (MXL-c) converges to a specific point in arg max R. To show that this is 
indeed the case, let Q(f) be an orbit of (MXL-c) and let Q* be an w-limit of Q(f), i.e. Q (t„) — > Q* for some increasing 
sequence t n —> oo. By Lemma 1, this implies that F(Q*, Y(f„)) —> 0, so, since F(Q*,Y(t)) is nonincreasing, we also 
get lim^oo F(Q*,Y(t)) = 0. We conclude that Q(f) — » Q* (again by Lemma 1) and our proof is complete. ■ 


B. Stochastic approximation and convergence 

We now proceed to show that the iterates of the MXL are asymptotically close to solution segments of (MXL-c) of 
arbitrary length - more precisely, that they comprise an asymptotic pseudotrajectory (APT) of (MXL-c) in the sense 
of [14], 

Proposition 2. Assume that (MXL) is run with a nonincreasing step-size sequence y n such that yj t < £ n y n = 
+oo and noisy measurements Yk satisfying (HI) and (H2). Then, the iterates Q (n) of (MXL) form an asymptotic 
pseudotrajectory of (MXL-c). 

Proof: Simply note that the recursion (MXL) can be written in the form: 


Yin + 1) = Y(n) + y n [V(Q(«)) + Z(«)]. 


(35) 
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Since the map Y Q is Lipschitz 14 and the rate function R( Q) is smooth over the compact spectrahedron <3, it follows 
that the map Y 1-2 V(Q(Y)) is Lipschitz and bounded. Our claim then follows from Propositions 4.2 and 4.1 in [14]. 

■ 

With all this said and done, we are finally in a position to prove Theorem 1: 

Proof of Theorem 1: Let <3* = argmaxQ eQ /?(Q) denote the solution set of (RM) and assume ad absurdum that 
Q (n) remains a bounded distance away from <3*. Furthermore, fix some Q* 6 <3* and let D„ = F(Q. Y(n)); a Taylor 
expansion of F then yields: 

Dn+I = F(Q\Y(n + 1)) = F(Q\Y(n) + y„V(«)) 

< D„ + y„ tr[(Q(«) - Q*)-V(Q(n))] + y„& + ^y,TllV(»)l| 2 , (36) 

where £ n = tr[Z(n) • (Q* - Q(«))] and we have used the fact that the convex conjugate li of the von Neumann entropy 
is 1-strongly smooth [17], 

Our original assumption that Q(«) remains a bounded distance away from <3* means that D n is bounded away from 
zero; moreover, with R concave and smooth, we will also have tr[V(n) • (Q (n) - Q*)] < —m for some m > 0. Thus, 
telescoping (36) yields: 

A, + i < D 0 - t„ {m - 2" =1 w M ft) + ^ Y! j=x y) ll^0')|| 2 , (37) 

where t n = 7j an d w j,n = By the strong law of large numbers for martingale differences [18, Theorem 2.18], 
we have ir x Yj')=\ fj 0 (a.s.); hence, with y„+i/y„ < 1, Hardy’s Tauberian summability criterion [40, p. 58] applied 
to the weight sequence wj„ = jjlt n yields Y!]=\ w j,n€j “ > 0 (a.s.). Finally, since y n is square-summable and y,,Z(n) is 
a martingale difference with finite variance, it follows that EJjLj y 2 ||V(«)|| 2 < 00 (a.s.) by Theorem 6 in [41]. 

Combining all of the above, we see that the RHS of (37) tends to -00 (a.s.); this contradicts the fact that I)„ < 0, 
so we conclude that Q (n) visits a compact neighborhood of <3* infinitely often. Since <3* attracts any initial condition 
Y(0) under the continuous-time dynamics (MXL), Theorem 6.10 in [14] shows that QOz) converges to (3*. as claimed. 

For the bound (13), note that (36) can be rewritten as 

y« tr[(Q* - Q {n)) ■ V(Q(n))] < D n - D n+l + y n f n + iy 2 ||V(n)|| 2 , (38) 

so, recalling that R is concave and V = Vq R, we get: 

Jn [flmax " R„\ < Jn *[«** ~ Q («)) ■ V(Q(«))] 

< D„ - D n+ 1 + y„f„ + ^yf, ||V(n)|| 2 . (39) 

Thus, taking expectations on both sides and telescoping, we obtain: 

Z"„ VI I s - - 5 D « + l vl Zf vl 

14 This follows from the fact that the von Neumann entropy (27) is strongly convex with respect to the nuclear norm [17, 24]. 


(40) 
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where we have used the fact that E[£„] = 0 and the finite mean square hypothesis E[||V(n )|| 2 ] < v 2 . From (30), 
we have Dq = F( Q*,0) < maxQQ,j/i(Q) - h(Q')\ = log M, so (13) follows by rearranging (40) and solving for 
E [^«] = t~ l 2" =1 7j E [^y]- 

Moreover, for the large deviations bound (14), Eq. (39) yields /? max - R„ < s„ + t n 1 "Yj€j> so 


P (/?max - R„ > s„ + z) < P (z"=l > t„z) , 


(41) 


with e„ = t n 1 (log M + \ V 2 £" = i y 2 j defined as in (15). Since y^j is a martingale difference with finite conditional 


variance Var(y^) < Ayjcr 2 for some A > 0, Chebyshev’s inequality yields: 


P fel \y£i\ * tnt) Z ^2 Z"=I Vdr( r,tj) < 2 "=1 Z"=l ^ 2 ) ■ 


(42) 


Finally, if the conditional distribution of V(n) given Q (n - 1) is symmetric around V(Q(n - 1)), the conditional 
distribution of £ n will be symmetric around 0, so the bound (16) follows from the exponential concentration inequality 
(6.1) of [19]. Otherwise, under the modified hypothesis (H2'), the exponential bound (16) follows from Theorem 1.2A 
in [19]. ■ 


C. Variants ofMXL 


In this section, we prove the convergence of the variant exponential learning schemes MXL-a and MXL-eig (Algo¬ 
rithms 2 and 3 respectively). 

Proof of Theorem 2: We will show that the recursion (MXL-a) is an asynchronous stochastic approximation of 
(MXL-c) in the sense of [15, Chap. 7]. Indeed, by Theorems 2 and 3 in [15], the recursion (MXL-a) may be viewed as 
a stochastic approximation of the rate-adjusted dynamics 


y* = m Vk 

exp(Yi) 


(43) 


tr[exp(Y*)] 

where we have momentarily reinstated the user index k and ///. = n^/n > 0 denotes the update rate of user k 

(the existence and positivity of this limit follows from the ergodicity of the update process 'K„). This multiplicative 
factor does not alter the rest points and internally chain transitive (ICT) sets [14] of the dynamics (MXL-c), so (43) 
converges to arg max R from any initial condition and the proof of Theorem 1 carries through essentially verbatim. ■ 
To prove Theorem 3, we first need to derive the eigen-dynamics (23) induced by (MXL-c): 


Proposition 3. Let Q (t) be a solution orbit of (MXL-c) and let { q a (t ), u„(t)\ be a smooth eigen-decomposition ofQ(t). 
Then, {q a (f),u a (t)} is a solution of the eigen-dynamics (23). 

Proof: By differentiating the identity q a 6 a p = u . Qu/j, we readily obtain: 

q a 6ap = uQli/; + u 0 11 /: + ifQh/; 

= u^Qpg + (q a - qplulfip. 


(44) 
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where the last equality follows by differentiating the orthogonality condition u^Us = 6 a p. Thus, by a) taking a = fi and 
b) solving for u, K in (44), we respectively obtain: 


<7 


<* - u ,Qu (> 


ut = y 

/—ip 


u.Qii/ 


-‘Ptaqa-qp P 

However, by using the Frechet derivative of the matrix exponential [42], we readily get: 

Q= 1 ^exp(Y)-exp(Y) tr[VeXp(Y) 2 ] 
tr[exp(Y)] dt tr[exp(Y)]- 


(45a) 

(45b) 


-— r 

P(Y)] Jo 


tr[exp(Y)] 


exp((l - s)Y)Y exp( sY) ds - Qtr[VQ] 


f 


Q 1 f VQ* Js-Qtr[VQ], 


(46) 


and hence: 


ulQU/S 


'VQ'u/; ds - tr[VQ | • tfj.Qii/i 


[' ulQ 1 

Jo 

Qa Vaptfp ds — Cladap ^ ^ QyVyy-> 


where we have set V a p = u^Vuy?. Thus, by carrying out the integration in (47), we finally obtain: 

Qa ~ Qp 


<Qup 


- Vap Qa^ap Z, q y V T 


log qa — log qp ‘ - ' '—‘y 

with the convention (x - y )/(logx - log y) — x if x = y. Eq. (23) then follows by substituting (48) in (45). 
Proof of Proposition 1: Combining (MXL) and the derivative expression (46), we get: 

exp(Y(n + 1)) exp(Y(n) + y„V(n)) 


(47) 


(48) 


Q (« + 1) = 


tr[exp(Y(n + 1))] tr[exp(Y(n) + y„V(n))] 


= Q 00 + 7/ 


f 1 Q(«) 1_ 

Jo 


'V(n)Q (ri) s ds 


- y„ tr[Q(n)V(n)] • Q (n) + O (y 2 ||V(n)[| 2 ) , (49) 

where the term 0(y„ ||V(n)|| 2 ) is bounded from above by Cy 2 ||V(n)|| 2 for some constant C that does not depend on 
Q (ri). Since y„ —» 0 by assumption, Remark 4.5 in [14] shows that the quadratic error in (49) can be ignored in the 
long-run, so Q (n) is an APT of the dynamics (46). Hence, by Proposition 3, the eigen-decomposition {q a (n),u a (n)} is 
an APT of (23), as claimed. ■ 

Proof of Theorem 3: Consider the following Euler discretization of the eigen-dynamics (23): 


q n 


q a + y, 


nqa J ^ 


<- + y„ Y 


ijs 
Vp a 


-Us, 

J P* a log q a - log qp 

i.e. the update step of Alg. 3 without the orthonormalization correction for u,>. We then obtain: 


(50a) 

(50b) 


u l(n + 1) ■ up(n + 1) = u l(n) ■ u p(n) + <9(y 2 ). 


(51) 
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which shows that the orthonormalization correction in Alg. 3 is quadratic in y n . Thus, as long as y„ is chosen small 
enough (so that q a {n) > 0 for all n). Remark 4.5 in [14] shows that the iterates of Alg. 3 comprise an APT of (23). In 
turn, the same reasoning as in the proof of Prop. 1 can be used to show that Q (n) = q a {n)\i a {n)\^ a (ri) is an APT of 
(MXL-c), so Q (n) converges to the solution set of (RM) by Theorem 1. ■ 

D. The fast-fading regime 

Proof of Theorem 4: Let V erg = VR elg denote the gradient of the ergodic sum rate function R c ,„ and consider the 
dynamics: 

Y = V erg , 

= exp(Y) (52) 

tr[exp(Y)] 

The same reasoning as in the proof of Theorem 5 shows that (52) converges to the unique minimizer of the (strictly 
concave) sum rate maximization problem (ERM). Moreover, given that R is concave for any fixed channel matrix H 
and R erg is finite on <3, we have [43]: 

Verg = V Q R erg = Eh [VqR(Q)] = E H [V], (53) 

with V defined as in (11). With V bounded, it follows that V erg is Lipschitz, so Propositions 4.2 and 4.1 in [14] imply 
that the iterates of (MXL) with noisy measurements satisfying (HI) and (H2) comprise a stochastic approximation of 
the mean dynamics (52). The rest of the proof then follows as in the case of Theorem 1. ■ 
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